Goto

Collaborating Authors

 unannotated data



Improving the learning process and providing more accurate similarity matrices for unannotated data can positively

Neural Information Processing Systems

We sincerely thank the reviewers for their valuable comments. We proofread and fixed the mentioned errors. Related Work: Thank you for the additional references. We will include and discuss them in the revised version. Publishing codes: Upon the acceptance of our paper, we will publicly release the source codes.


RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos

Zare, Ali, Niu, Yulei, Ayyubi, Hammad, Chang, Shih-fu

arXiv.org Artificial Intelligence

Procedure Planning in instructional videos entails generating a sequence of action steps based on visual observations of the initial and target states. Despite the rapid progress in this task, there remain several critical challenges to be solved: (1) Adaptive procedures: Prior works hold an unrealistic assumption that the number of action steps is known and fixed, leading to non-generalizable models in real-world scenarios where the sequence length varies. (2) Temporal relation: Understanding the step temporal relation knowledge is essential in producing reasonable and executable plans. (3) Annotation cost: Annotating instructional videos with step-level labels (i.e., timestamp) or sequence-level labels (i.e., action category) is demanding and labor-intensive, limiting its generalizability to large-scale datasets.In this work, we propose a new and practical setting, called adaptive procedure planning in instructional videos, where the procedure length is not fixed or pre-determined. To address these challenges we introduce Retrieval-Augmented Planner (RAP) model. Specifically, for adaptive procedures, RAP adaptively determines the conclusion of actions using an auto-regressive model architecture. For temporal relation, RAP establishes an external memory module to explicitly retrieve the most relevant state-action pairs from the training videos and revises the generated procedures. To tackle high annotation cost, RAP utilizes a weakly-supervised learning manner to expand the training dataset to other task-relevant, unannotated videos by generating pseudo labels for action steps. Experiments on CrossTask and COIN benchmarks show the superiority of RAP over traditional fixed-length models, establishing it as a strong baseline solution for adaptive procedure planning.